Utility of general and specific word embeddings for classifying translational stages of research

نویسندگان

  • Vincent Major
  • Alisa Surkis
  • Yindalon Aphinyanagphongs
چکیده

Conventional text classification models make a bagof-words assumption reducing text, fundamentally a sequence of words, into word occurrence counts per document. Recent algorithms such as word2vec and fasttext are capable of learning semantic meaning and similarity between words in an entirely unsupervised manner using a contextual window and doing so much faster than previous methods. Each word is represented as a vector such that similar meaning words such as “strong” and “powerful” are in the same general Euclidian space. Open questions about these embeddings include their usefulness across classification tasks and the optimal set of documents to build the embeddings. In this work, we demonstrate the usefulness of embeddings for improving the state of the art in classification for our tasks and demonstrate that specific word embeddings built in the domain and for the tasks can improve performance over general word embeddings (learnt on news articles, Wikipedia or

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sentiment Analysis of Citations Using Word2vec

Citation sentiment analysis is an important task in scientific paper analysis. Existing machine learning techniques for citation sentiment analysis are focusing on labor-intensive feature engineering, which requires large annotated corpus. As an automatic feature extraction tool, word2vec has been successfully applied to sentiment analysis of short texts. In this work, I conducted empirical res...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

Do We Need Discipline-Specific Academic Word Lists? Linguistics Academic Word List (LAWL)

This corpus-based study aimed at exploring the most frequently-used academic words in linguistics and compare the wordlist with the distribution of high frequency words in Coxhead’s Academic Word List (AWL) and West’s General Service List (GSL) to examine their coverage within the linguistics corpus. To this end, a corpus of 700 linguistics research articles (LRAC), consisting of approximately ...

متن کامل

Adapting Pre-trained Word Embeddings For Use In Medical Coding

Word embeddings are a crucial component in modern NLP. Pre-trained embeddings released by different groups have been a major reason for their popularity. However, they are trained on generic corpora, which limits their direct use for domain specific tasks. In this paper, we propose a method to add task specific information to pre-trained word embeddings. Such information can improve their utili...

متن کامل

A classification of hull operators in archimedean lattice-ordered groups with unit

The category, or class of algebras, in the title is denoted by $bf W$. A hull operator (ho) in $bf W$ is a reflection in the category consisting of $bf W$ objects with only essential embeddings as morphisms. The proper class of all of these is $bf hoW$. The bounded monocoreflection in $bf W$ is denoted $B$. We classify the ho's by their interaction with $B$ as follows. A ``word'' is a function ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1705.06262  شماره 

صفحات  -

تاریخ انتشار 2017